Skip to main content

Full text of "Coding Theory lecture: entropy and mutual information"

See other formats


Department of Mathematics, Mahidol University 


Kit Tyabandha, PhD 


Entropy and mutual information 

4 th November 2005 

Definition 1. A probability space is a triple (S,B,P) on the domain S, which is a nonempty set 
called the sample space , where (S, B) is a measurable space, B is a Borel field of subsets of S, and 
P is a measure on S with the property that P(S) = 1 and, for all disjoint E,j G B, 


( oo \ oo 

U =E p (^) 

i= 1/ i =1 

In other words, P is a nonnegative function defined for all events E t G B, and B measurable subsets 
of S. Further, a random variable N is a function mapping S into some set R, called the range of X. 
For convenience, we shall also use X to represent both the function and its own range, that is X is 
a function which maps S into X. If S is discrete and f is some real-valued function defined on S, 
then both X and f(X) are two different random variables, and the expectation of the latter is given 
by, 

E[f(X)] = ^p(.r)f(.r) 


Definition 2. Let p(x) be the probability that x G X occurs, similarly p (y) that y G Y does-, 
while p (x,y) that both x G X and y G Y do occur. Then, 


and 


, \ ^ P (x,y) 

p,l|s) = TT 
p(»w = 


( 1 ) 

( 2 ) 


Definition 3. A Markov chain is a set of random variable X t , where t = 0, 1, ..., such that, 
P(X t = j|X 0 =*o,...,X t _ 1 =i t _ 1 ) = P(X t = j|X t _i = i t -i) 

In other words, given the present state, the next state is conditionally independent of the past. 

§ 

Definition 4. A subset K C E n , where E" is the Euclidean space of n dimensions, is called 
convex if the line segment joining any two points in K is contained in K. Let the two points be x\ 
and X 2 , then the line segment joining them together is x = tx\ + (1 — t)x 2 , where 0 < t < 1. 

§ 

Definition 5. A point x is said to be a convex combination of points xi,...,x m if there ex¬ 
ist nonnegative scalars a such that = 1 and J2 a i x i = x. The set of all convex 

combinations of Xi, i = 1, ..., m, is called the convex hull of {aq}. 

§ 

Definition 6. Let f be a real-valued function, and let K be a convex subset of the domain of f. 
Then f is said to be convex cup if, for every X\,x -2 G K and 0 < t < 1, 

f(txi 4- (1 - t)x- 2 ) < ff(xi) + (1 - t) f(x 2 ) (3) 

It is said to be strictly convex cup if strict inequality holds in Equation 3 whenever x + 1 ^ x%, 
Similarly, f is said to be convex cap if, 

f (txi + (1 - t)x 0 ) > tf(xi) + (1 - t){(x 2 ) (4) 

that is to say, if — / is convex cup. It is said to be strictly convex cap if strict inequality holds in 
Equation 4 whenever xi ^ xi . Convex cap is also known as concave. Geometrically speaking, f is 

Coding theory, Entropy and mutual information -1- From 28 Oct 05, as of 4 th November, 2005 



Department of Mathematics, Mahidol University 


Kit Tyabandha, PhD 


convex cup if and only if all its chords lie above or on the graph of f, and f is concave if and only 
if all its chords lie below or on the graph of the same. 

§ 

Definition 7. Let K be some interval in E 1 , and let F(.t) be a probability distribution concen¬ 
trated on K such that P(X < x) = F(;r). Then, if the expectation E(X) exists, and if f(x) is a 
convex cup function, then, 

E(f(X)) > f(E(X)) (5) 

If f is strictly convex cup, then strict inequality holds in Equation 5. Similarly, if f is convex cap, 
then, 

E(f(X))<f(E(X) (6) 

If f is strictly convex cap, then strict inequality holds in Equation 6. 

§ 

Example 1. Suppose that in Definition 7 there is a mass distribution placed on the graph of f, 
then Equation 5 says that the overall centre of mass will lie above or on the graph, while Equation 
6 says that it will lie below it. 

Entropy is a measure of uncertainty of many events as a single value. We derive it from 
Axiom’s 1 and 2. 

Axiom 1. If the events are all equally likely, then the uncertainty function H (^,..., ^) is 
monotonously increasing with to. 

§ 

Axiom 2. If {E \,..., E} n } and {E'f ,..., Eff] are statistically independent sets of equally likely 
disjoint events, then the uncertainty of the sets of events {E t f~l Ej\i = 1,..., in: j = 1,..., n} is 



That is to gay, h (mn) = h (to) -f h(n), where h(m) = H( — 

§ 

Definition 8. Let the set of m possible disjoint events be E = {E \,..., E m }. We call an apriori 
probability of E,;, p(E;), where 1 < i < m and i p(T'j) = 1- The uncertainty function or the 
entropy function, H(p(l),..., p(m)) obeys Axiom’s 1 and 2. 

§ 

The entropy of a random variable x gives a measure of the amount of information obtained 
from an observation of x. It also represents the randomness of x and our uncertainty about x. The 
less probable an event is, the more information we receive when it occurs. 

Theorem 1. The entropy of a set of m equally likely events is h(m) = Alog c m, where A is a 
positive constant and c > 1. 

Proof. Proving Theorem 1 amounts to proving that Axiom’s 1 and 2 are satisfied if and only if 
h(m) = Alogym. The two axioms say that h(m) is monotonously increasing in m and 

h(m) = h(m) + h (n) (7) 

According to Equation 7, if m = n = 1, then h( 1) = h( 1) + h(l), which implies that h(l) = 0. 
From this together with both axioms above, h(m) = Alog c m is sufficient as a solution. 

Next, we must prove that this solution is necessarily the only solution. Let a, b and c be 
positive integers, and a,b,c > 1. Then there exists a unique integer d such that 

c d < a b < c d+1 (8) 

From Equation 8 it follows that, 

dlogc < b log a < ( d + 1) logc 

Coding theory, Entropy and mutual information -2- From 28 Oct 05, as of 4 th November, 2005 



Department of Mathematics, Mahidol University 


Kit Tyabandha, PhD 


and therefore, 


d log a d + 1 

- < — < - 

b log c b 

Since h(m) is monotonously increasing, from Equation 8 we have, 


h(c rf ) < h(« 6 ) < h(c d+1 ) 


(9) 


Then from Equation 7, dh(c) < bh{a) < (d + l)h(c). And since h(m) is monotonously increasing, 


d h(a) d + 1 

- < —— < - 

b h(c) b 


From Equation’s 9 and 10 it follows that, 


( 10 ) 


log a 
log c 


M«) 

h(c) 



And, since b is arbitrary positive integer, 

h(a) logo 
h(c) logc 
h(a) h(c) 
log a log c 

Since a and c are arbitrary, 

h(«) = A = h(c) 
log a log c 

Therefore, necessarily h(m) = Alogym is the only solution. f 

Axiom 3. The total uncertainty of events does not depend on the method of indication. 

§ 


Axiom 4. The uncertainty measure is a continuous function with regard to the probabilities 
within it. 

§ 

Example 2. Let a set E of rri disjoint events be {£j,..., E m }. Let , i = 0,..., n, be integers 
and 0 = jo < j i < jo • • • < j n = m, and E be divided into n sets of events, namely, 

Gi = {Ei,... ,Ej 1 } 

Go = {Ej 1+ 1, • • •, E h ) 


Gn — {Ej n _ 1 - {-1, • • •, E m } 

If we indicate firstly the group, and then the event within that group, then the uncertainty becomes, 


H(p(£d),... ,p(E m )) = H(p(G 1 ),... ,p(G„)) + Y, P(G*)H(p(Ej i _ 1+1 |G*),... ,p(E* |G,;)) (11) 

%■=! 

The grouping axiom, Axiom 3, lets us express the uncertainty when all the event probabilities 
are rational. By grouping equally likely events together and then consider each of the groups as a 
single event, it gives us the ability to deal with events which are not equally likely. Example 3 gives 
an example how this is done. Then Axiom 4 extends Axiom 3 to cover also irrational probabilities, 
and Equation 12 is the result. 

Coding theory, Entropy and mutual information -3- From 28 Oct 05, as of 4 th November, 2005 



Department of Mathematics, Mahidol University 


Kit Tyabandha, PhD 


Example 3. As in Example 2, let a set of disjoint events be E = {E \,..., E m }, and let p(E,) = -E, 
i = 1,, m. Also, let the groups of events Gi,..., G„ be defined the same way therein. Let n* 
be the number of events in G k . Then n k = ju — jk-i and p(G^) = ^, for k = 1...., n, and also 
p(Ei\G k ) = for i < i < j k . Then Equation 11 yields, 


h(m) = H(p(Gi),..,.p(G„)) + ^p(G,;)h(n, : ) 

i =1 

And since from Theorem 1, h(m) = Alog c ra, we have, 

n 

H(p(Gi),... ,p(G„)) = - ^p(Gi)(h(ni) - h(m)) 

i =1 
n 

= -Ep(G‘) ( A 1 °eS) 

i =1 

= -A ^^P(Gi)logp(Gi)J (12) 

Example 4. From h(m) = Alog c m, if we let A = log 6 c, then h(m) = log 6 m. In other words, 
the scale factor A can be absorbed in the base of the logarithm. 

Theorem 2. Let {pi,... , p m } be a set of probabilities such that Yl'lLiPi = 1- Then, f 


H(p 1 ,...,p m ) = -^p i logp i (13) 

i=l 

Proof. This is the results from Example’s 2 and 3, and the scale factor A disappears in a manner 
similar to that shown by Example 4. f 

Example 5. If the base of the logarithm in Equation 13 is 2, the unit of the entropy is bit. On 
the other hand if this base is e, that is to say, if we use natural logarithms, then the uncertainty has 
the unit of nat. From this, one may see that one nat is equal to log 2 e bits, which is approximately 
1.443 bits. The term bit comes from binary digit, the term nat from natural digit. 

Definition 9 explains what is meant by conditional entropy. Starting from Equation 14, which 
is an equation for conditional entropy when y is given, we obtain the overall conditional entropy in 
Theorem 3. For any pair of sets X and Y given, H(X|Y) gives the amount of uncertainty remaining 
about X after Y has been observed. 

Definition 9. The conditional entropy of X, given some y £ Y, is, 

H(X|-i/) = -^p( ; c|p)logp(;r||/) (14) 


Then the conditional entropy H(X| Y) is the expectation, or average value, of H(X|p) over the range 
Y. In other words, 

H(X|Y)=£p(j,)H(X|j/) (15) 

y 


Theorem 3. The conditional entropy is, 

H(X|Y) = -•£ p(x, y) logp(x|y) 

f Some times the entropy function is defined instead by H(pi,... ,p m ) = YliLi Pi l°g but this 
is obviously the same as our Equation 13 since logai -1 = — log;r. 

Coding theory, Entropy and mutual information -4- From 28 Oct 05, as of 4 th November, 2005 



Department of Mathematics, Mahidol University 


Kit Tyabandha, PhD 


Proof. Putting the equation of conditional entropy when y is given, Equation 14, into the overall 
conditional entropy equation, Equation 15, we get, 

H(X|Y) = ^p(j/)H(X|i/) 

y 

y x 

Then from Equation 1 of Definition 2, p(y)p(x\y) — p (x,y), and so, 

h<xiy) = -x; P (x,y) logp(;c|y) 


Theorem 4. Let X, Y and Z be discrete random variables. For each z G Z, let E (z) = 

T, x , y p(yM z W,y)- Then , 

H(X|Y) < H(Z) + E(logE) 


Proof. 


H(X|Y) = —E [logp(a,’|j/)] 

= - E v{x,y,z)logv{x\y) 

x,y,z 


x,y 


p(*) 


Because 


P jz,y,z) 

pW 


= p(x,y\z) 


is a probability distribution, that is a convex cap function, we may apply Equation 6, namely 
Jensen’s inequality for convex cap, from Definition 7. Hence, 


H(X|Y)<£p(*)log 


P {x,y,z 

p(-2) pW y) 


= y , p(-) log -E + E p(-) lo s E 


But, 


p(s) 

p (x,y,z) _ p{x,y,z)p{y) 
p (x,y) 


pQujbf) 
p(a-| y) 


= p(.'/!p(~]- r - y) 


p(*l y) 

hence the statement above is proved. f 

Corollary 4[1]. Let X adn Y be random variables each of which takes values in the set 
{;ci ,... ,x r }. Let P e = P(X Y). Then, 

H(X|Y) < H(P e ) + P e log(r — 1) 

Proof. From Theorem 4, let Z = 0 if X = Y, and let Z = 1 if X ^ Y. Then E(0) = 1 and 
E(l) = r — 1. 1 

Theorem 5. The maximum uncertainty occurs when the events are equiprobable. 

Proof. Since, 


H ) - H(p].,... ,p m ) = log 6 m + Vpj log b pi 

V m m) 

i =1 


= log,, e ^2 Pi In m Pi 

i=1 

m / i \ 

> io s6 E p' 1 — = ° 

' \ m/n■ / 


i =1 


mpi 


it being the case that In i > 1 — x. Therefore H(pi,... ,p m ) is maximised when p t = for all 

i = If 


Coding theory, Entropy and mutual information -5- From 28 Oct 05, as of 4 th November, 2005 



Department of Mathematics, Mahidol University 


Kit Tyabandha, PhD 


Example 6. Figure 1 shows that In 2 < a; — 1, while Figure 2 shows that such inequality does 
not exist when the logarithm in question is of base 10. 

Figure 1 Plots of In 2 ; and x — 1, which show that In x < x — 1. 


In x < x — 1 



Figure 2 Graphs of y = log a; and y = x — 1, which show that the latter is no bound for the 
values of the former. 


log 10 x and x — 1 



Example 7. Figure 3 confirms for us how In 4 > 1 — x. whereas Figure 4 tells us that this is the 
case for log ^. 

Figure 3 Plots showing In ^ and 1 — x, which show that In ^ > 1 — x. 


In | < 1 — x 



Figure 4 Graphs showing y = log - and y = 1 — x, from which it is clear the latter gives no 
bounds for the former. 

Coding theory, Entropy and mutual information -6- From 28 Oct 05, as of 4 th November, 2005 






Department of Mathematics, Mahidol University 


Kit Tyabandha, PhD 


logio j and 1 — x 



Example 8. Consider two events with probabilities p and 1 — p. The entropy function is then, 

H(p, 1 -p) = -plogp- (1 -p)log(l -p) 

Whenever the occurrence of either event become certainty, the entropy function would become zero. 
Mathematically we see that lim p _> 0 plogp = 0 and lim p _>i plogp = 0. Figure 5 shows a plot of the 
values of the entropy function for two events. Base-2 logarithm is used here. 

Figure 5 The entropy function of two events with probabilities p and 1 — p. 


= log 2 p (1 -p)log 2 (l -p) 



Definition 10. The mutual information is I(X; Y) = H(X) — H(X|Y). It represents the informa¬ 
tion provided about X by Y. 

§ 


Example 9. Alternatively, the mutual information may take the following form, cf Definition 2, 


I(X;Y) = £>(*,!/) log 

x,y 


p(fjj/) 

p(x-) 


= J2p{x,y) log 

x,y 


p (x,y) 
p(-c)p (y) 


= log 

x,y 


p(:i/|a-) 

p (y) 


Coding theory, Entropy and mutual information -7- From 28 Oct 05, as of 4 th November, 2005 



































































Department of Mathematics, Mahidol University 


Kit Tyabandha, PhD 


That is to say, I(X; Y) is the average taken over the X, Y sample space of the random variable 
I(x; y 0 such that, 


l(x;y) = log 


p(gj y) 

p(a-) 


log 


P jx,y) 
p(x)p{y) 


log 


p(:i/|a-) 
p (y) 


Theorem 6. For any discrete random variables X and Y, I(X; Y) > 0. Moreover, I(X; Y) = 0 if 
and only if X and Y are independent. 


Proof. From one of our formulae for the mutual information and from Jensen’s inequality, 


I(X;Y) 


E lo s 


p(a-)p (y) 
P (x,y) 


> log ^2 P( a 0P(2/) = log 1 = 0 

x,y 


Furthermore, the equality sign holds if and only if p(ai)p(«/) = p(.x, y) for all x and y, that is to say, 
when X and Y are independent of each other. f 

Example 10. From our formulae of the mutual information, we may see that, 

I(X; Y) = I(Y; X) 


and 

I(X; Y) = H(Y) — H(Y|X) 

Also, 

IKYI^pC^llog^ 

Definition 11. Let X, Y and Z be three random variables. Then the mutual information 
I(X,Y;Z) is given by, 


I(X, Y; Z) = E 



P(fjfM/A 

p(*) ) 


^2 3) log 

x,y,z 


p(-| x,y) 

P(z) 


This mutual information is the amount of information X and Y provide about 


Z. 


§ 


Theorem 7. Let X, Y and Z be three random variables. Then we have I(X,Y;Z) > I(Y;Z), 
where the equality holds if and only if p (z\ae,y) = p(z\y) for all (./•. y. z) such that p (x,y,z) > 0. 


Proof. 


I(Y; Z) — I(X, Y; Z) = E 
= E 



P(%) _ 
P(-) 
p(^|y) 
p(z\x,y) 


log 


p 

pW J 


= ^2 p( ;r >i/,-)iog 

x,y,z 


p{z\y) 

p(z[x,y) 


Then using Jensen’s inequality, we have, 


I(Y; Z) — I(X, Y; Z) 


< log ^2 P (x, y,z) 

x,y,z 


p(z\y) 

p{z\x,y) 


= log ^2 p ( ;r ’ y)vi z \y) = log i = o 

x,y,z 


f 

Coding theory, Entropy and mutual information -8- From 28 Oct 05, as of 4 th November, 2005 



Department of Mathematics, Mahidol University 


Kit Tyabandha, PhD 


Theorem 8. Let (X, Y,Z) be a Markov chain. Then, 

Proof. From Theorem 7, I(X;Z) < I(X,Y;Z). Because (X,Y, Z) is a Markov chain, I(X,Y;Z) = 
I(Y;Z). Therefore I(X; Z) < I(Y;Z). Next, since (X,Y, Z) is a Markov chain, (Z,Y,X) is also a 
Markov chain. Hence I(X; Z) < I(X;Y). f 


Bibliography 

Solomon W Golomb, Robert E Peile and Robert A Scholtz. Basic concepts in information theory 
and coding , The adventures of Secret Agent 00111. Plenum Press, New York, 1994 
Robert J McEliece The theory of information and coding. Addison-Wesley, 1977 


Coding theory, Entropy and mutual information 9 


From 28 Oct 05, as of 4 th November, 2005